Goto

Collaborating Authors

 multi-pass sgd


RiskBoundsofMulti-PassSGDforLeastSquaresin theInterpolationRegime

Neural Information Processing Systems

Despite the extensive application of multi-pass SGD in practice, there are only a few theoretical techniques being developed to study the generalization of multi-pass SGD.


RiskBoundsofMulti-PassSGDforLeastSquaresin theInterpolationRegime

Neural Information Processing Systems

Despite the extensive application of multi-pass SGD in practice, there are only a few theoretical techniques being developed to study the generalization of multi-pass SGD.



Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime

Neural Information Processing Systems

Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization. Most of existing generalization analyses are made for single-pass SGD, which is a less practical variant compared to the commonly-used multi-pass SGD. Besides, theoretical analyses for multi-pass SGD often concern a worst-case instance in a class of problems, which may be pessimistic to explain the superior generalization ability for some particular problem instance. The goal of this paper is to provide an instance-dependent excess risk bound of multi-pass SGD for least squares in the interpolation regime, which is expressed as a function of the iteration number, stepsize, and data covariance. We show that the excess risk of SGD can be exactly decomposed into the excess risk of GD and a positive fluctuation error, suggesting that SGD always performs worse, instance-wisely, than GD, in generalization. On the other hand, we show that although SGD needs more iterations than GD to achieve the same level of excess risk, it saves the number of stochastic gradient evaluations, and therefore is preferable in terms of computational time.






Improved Scaling Laws in Linear Regression via Data Reuse

Lin, Licong, Wu, Jingfeng, Bartlett, Peter L.

arXiv.org Machine Learning

Neural scaling laws suggest that the test error of large language models trained online decreases polynomially as the model size and data size increase. However, such scaling can be unsustainable when running out of new data. In this work, we show that data reuse can improve existing scaling laws in linear regression. Specifically, we derive sharp test error bounds on $M$-dimensional linear models trained by multi-pass stochastic gradient descent (multi-pass SGD) on $N$ data with sketched features. Assuming that the data covariance has a power-law spectrum of degree $a$, and that the true parameter follows a prior with an aligned power-law spectrum of degree $b-a$ (with $a > b > 1$), we show that multi-pass SGD achieves a test error of $Θ(M^{1-b} + L^{(1-b)/a})$, where $L \lesssim N^{a/b}$ is the number of iterations. In the same setting, one-pass SGD only attains a test error of $Θ(M^{1-b} + N^{(1-b)/a})$ (see e.g., Lin et al., 2024). This suggests an improved scaling law via data reuse (i.e., choosing $L>N$) in data-constrained regimes. Numerical simulations are also provided to verify our theoretical findings.


Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization

Vansover-Hager, Shira, Koren, Tomer, Livni, Roi

arXiv.org Machine Learning

We study the out-of-sample performance of multi-pass stochastic gradient descent (SGD) in the fundamental stochastic convex optimization (SCO) model. While one-pass SGD is known to achieve an optimal $Θ(1/\sqrt{n})$ excess population loss given a sample of size $n$, much less is understood about the multi-pass version of the algorithm which is widely used in practice. Somewhat surprisingly, we show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample performance significantly and lead to overfitting. In particular, using a step size $η= Θ(1/\sqrt{n})$, which gives the optimal rate after one pass, can lead to population loss as large as $Ω(1)$ after just one additional pass. More generally, we show that the population loss from the second pass onward is of the order $Θ(1/(ηT) + η\sqrt{T})$, where $T$ is the total number of steps. These results reveal a certain phase-transition in the out-of-sample behavior of SGD after the first epoch, as well as a sharp separation between the rates of overfitting in the smooth and non-smooth cases of SCO. Additionally, we extend our results to with-replacement SGD, proving that the same asymptotic bounds hold after $O(n \log n)$ steps. Finally, we also prove a lower bound of $Ω(η\sqrt{n})$ on the generalization gap of one-pass SGD in dimension $d = \smash{\widetilde O}(n)$, improving on recent results of Koren et al.(2022) and Schliserman et al.(2024).